The assignment has the following requirements:
What is the question?
What did you do?
How well did it work?
What did you learn?
The task of the assignment is to comibine all the teachings from the previous assignments and combine them in one cohesive code article.
The data that is being used in this article is from IBM HR Analytics present on kaggle. You can use this following link to download the data from my repository.
The data has following columns:
!pip install h2o
!pip install shap
!pip install seaborn
import seaborn as sns
import shap
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import h2o
from h2o.automl import H2OAutoML
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.grid.grid_search import H2OGridSearch
from sklearn.model_selection import train_test_split
import seaborn as sns
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting h2o
Downloading h2o-3.40.0.3.tar.gz (177.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 177.6/177.6 MB 4.1 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from h2o) (2.27.1)
Requirement already satisfied: tabulate in /usr/local/lib/python3.9/dist-packages (from h2o) (0.8.10)
Requirement already satisfied: future in /usr/local/lib/python3.9/dist-packages (from h2o) (0.18.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (1.26.15)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (2.0.12)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (2022.12.7)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (3.4)
Building wheels for collected packages: h2o
Building wheel for h2o (setup.py) ... done
Created wheel for h2o: filename=h2o-3.40.0.3-py2.py3-none-any.whl size=177694727 sha256=1e014a0c4351d1e2385a0d1f73c0633e73365ae77508703b4958c07a319f2129
Stored in directory: /root/.cache/pip/wheels/9a/54/b6/c9ab3e71309ef0000bbe39e715020dc151bbfc557784b7f4c9
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.40.0.3
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting shap
Downloading shap-0.41.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (572 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 572.4/572.4 KB 10.3 MB/s eta 0:00:00
Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (from shap) (1.22.4)
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.9/dist-packages (from shap) (4.65.0)
Requirement already satisfied: numba in /usr/local/lib/python3.9/dist-packages (from shap) (0.56.4)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.9/dist-packages (from shap) (23.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.9/dist-packages (from shap) (1.10.1)
Collecting slicer==0.0.7
Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.9/dist-packages (from shap) (2.2.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (from shap) (1.4.4)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.9/dist-packages (from shap) (1.2.2)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.9/dist-packages (from numba->shap) (0.39.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.9/dist-packages (from numba->shap) (67.6.1)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas->shap) (2022.7.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-learn->shap) (1.1.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from scikit-learn->shap) (3.1.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas->shap) (1.16.0)
Installing collected packages: slicer, shap
Successfully installed shap-0.41.0 slicer-0.0.7
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: seaborn in /usr/local/lib/python3.9/dist-packages (0.12.2)
Requirement already satisfied: pandas>=0.25 in /usr/local/lib/python3.9/dist-packages (from seaborn) (1.4.4)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /usr/local/lib/python3.9/dist-packages (from seaborn) (3.7.1)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in /usr/local/lib/python3.9/dist-packages (from seaborn) (1.22.4)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.0)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (8.4.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.7)
Requirement already satisfied: importlib-resources>=3.2.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (5.12.0)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.39.3)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas>=0.25->seaborn) (2022.7.1)
Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.9/dist-packages (from importlib-resources>=3.2.0->matplotlib!=3.6.1,>=3.1->seaborn) (3.15.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)
### Reading data from the github repository
data = pd.read_csv('https://raw.githubusercontent.com/TarushS-1996/DataScience_001067923/main/IBMHRAttritionDataset.csv')
### getting data type
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null object 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null int64 7 EducationField 1470 non-null object 8 EmployeeCount 1470 non-null int64 9 EmployeeNumber 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null object 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: int64(26), object(9) memory usage: 402.1+ KB
### Checking if data is null or not
data.isnull().sum()
### Encoding data as 1 and 0 without increasing the dimensionality of the database
one_hot = {'Yes': 1, 'No': 0, 'Y':1, 'N':0, 'Male': 0, 'Female': 1}
data.Attrition = [one_hot[item] for item in data.Attrition]
data.OverTime = [one_hot[item] for item in data.OverTime]
data.Over18 = [one_hot[item] for item in data.Over18]
data.Gender = [one_hot[item] for item in data.Gender]
### Using pd.get_dummies() to create one-hot encofidng where data type was Object
data = pd.get_dummies(data, columns = ['BusinessTravel', 'Department', 'MaritalStatus', 'EducationField'])
data = data.drop(['EmployeeCount', 'HourlyRate', 'DailyRate', 'Over18', 'StandardHours', 'JobRole', 'EmployeeNumber'], axis = 1)
### Specifying data type as int64
data = data.astype('int64')
### Making sure data type is correctly changed
print(data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null int64 2 DistanceFromHome 1470 non-null int64 3 Education 1470 non-null int64 4 EnvironmentSatisfaction 1470 non-null int64 5 Gender 1470 non-null int64 6 JobInvolvement 1470 non-null int64 7 JobLevel 1470 non-null int64 8 JobSatisfaction 1470 non-null int64 9 MonthlyIncome 1470 non-null int64 10 MonthlyRate 1470 non-null int64 11 NumCompaniesWorked 1470 non-null int64 12 OverTime 1470 non-null int64 13 PercentSalaryHike 1470 non-null int64 14 PerformanceRating 1470 non-null int64 15 RelationshipSatisfaction 1470 non-null int64 16 StockOptionLevel 1470 non-null int64 17 TotalWorkingYears 1470 non-null int64 18 TrainingTimesLastYear 1470 non-null int64 19 WorkLifeBalance 1470 non-null int64 20 YearsAtCompany 1470 non-null int64 21 YearsInCurrentRole 1470 non-null int64 22 YearsSinceLastPromotion 1470 non-null int64 23 YearsWithCurrManager 1470 non-null int64 24 BusinessTravel_Non-Travel 1470 non-null int64 25 BusinessTravel_Travel_Frequently 1470 non-null int64 26 BusinessTravel_Travel_Rarely 1470 non-null int64 27 Department_Human Resources 1470 non-null int64 28 Department_Research & Development 1470 non-null int64 29 Department_Sales 1470 non-null int64 30 MaritalStatus_Divorced 1470 non-null int64 31 MaritalStatus_Married 1470 non-null int64 32 MaritalStatus_Single 1470 non-null int64 33 EducationField_Human Resources 1470 non-null int64 34 EducationField_Life Sciences 1470 non-null int64 35 EducationField_Marketing 1470 non-null int64 36 EducationField_Medical 1470 non-null int64 37 EducationField_Other 1470 non-null int64 38 EducationField_Technical Degree 1470 non-null int64 dtypes: int64(39) memory usage: 448.0 KB None
from statsmodels.graphics.gofplots import qqplot
data_col = data[['Age', 'DistanceFromHome', 'EnvironmentSatisfaction', 'JobSatisfaction', 'Gender', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating', 'StockOptionLevel', 'TotalWorkingYears', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'Attrition']]
for c in data_col.columns[:]:
plt.figure(figsize=(8,5))
fig=qqplot(data_col[c],line='45',fit='True')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlabel("Theoretical quantiles",fontsize=15)
plt.ylabel("Sample quantiles",fontsize=15)
plt.title("Q-Q plot of {}".format(c),fontsize=16)
plt.grid(True)
plt.show()
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
<Figure size 800x500 with 0 Axes>
### Getting the count of class samples and making sure there are almost equal samples for various classes.
print("Job satisfaction Count: ")
print(data.JobSatisfaction.value_counts())
Job satisfaction Count: 4 459 3 442 1 289 2 280 Name: JobSatisfaction, dtype: int64
As we can see above Job Satisfaction has different number of examples. The model always tend to perform better with equal number of samples for training data and might suffer from overfitting.
From the above count we can see that for the classes and samples
| class | Sample |
|---|---|
| 1.0 | 280 |
| 2.0 | 289 |
| 3.0 | 442 |
| 4.0 | 459 |
there is difference in values of 1.0 and 2.0 vs 3.0 and 4.0.
Thus we can assume that regularization might be able to help reduce the high variance.
On plotting correlation heatmap and viewing the values.
### Setting up the figure size
plt.figure(figsize=(30,10))
### Spcifying the type of plot, data, colormap and annotation using sns library
sns.heatmap(data.corr(), annot=True, cmap='RdYlGn')
<Axes: >
Form the above correaltion matrix we can remove some of the values as they seem to be highly related to eachother and replot the heatmap of the correlation matrix
data = data.drop(['JobLevel', 'MonthlyIncome', 'PerformanceRating', 'TotalWorkingYears'], axis = 1)
### Replotting the heatmap to see if any other value can be removed.
plt.figure(figsize=(30,10))
sns.heatmap(data.corr(), annot=True, cmap='RdYlGn')
<Axes: >
plt.figure(figsize=(50, 20))
sns.boxplot(data=data)
<Axes: >
As we can see, the data needs normalization
from sklearn import preprocessing
dataScaled = pd.get_dummies(data)
min_max_scalar = preprocessing.MinMaxScaler()
xMR = dataScaled[['MonthlyRate']].values.astype(int)
xMRS = min_max_scalar.fit_transform(xMR)
dataScaled['MonthlyRate'] = pd.DataFrame(xMRS)
xA = dataScaled[['Age']].values.astype(int)
xAS = min_max_scalar.fit_transform(xA)
dataScaled['Age'] = pd.DataFrame(xAS)
xD = dataScaled[['DistanceFromHome']].values.astype(int)
xDS = min_max_scalar.fit_transform(xD)
dataScaled['DistanceFromHome'] = pd.DataFrame(xDS)
plt.figure(figsize=(50, 20))
sns.boxplot(data=dataScaled)
<Axes: >
As we can see our current data has outliers as well. This might have some impact on model performance. We could use regularization for our model
data = dataScaled
Now we can fit the data to a linear regression model and predict for job satisfaction rating of a company.
To do this we will speciy the x and y values and for x value we will drop the column that is our y.
### Sepcifying the y value
y = data.Attrition
### Specifying the x value and dropping JobSatisfcation from data.
x = data.drop(['Attrition'], axis = 1)
### Checking if x has the right columns present
print(x.columns)
### Checking if y value is correctly selected
print(y)
Index(['Age', 'DistanceFromHome', 'Education', 'EnvironmentSatisfaction',
'Gender', 'JobInvolvement', 'JobSatisfaction', 'MonthlyRate',
'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
'RelationshipSatisfaction', 'StockOptionLevel', 'TrainingTimesLastYear',
'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion', 'YearsWithCurrManager',
'BusinessTravel_Non-Travel', 'BusinessTravel_Travel_Frequently',
'BusinessTravel_Travel_Rarely', 'Department_Human Resources',
'Department_Research & Development', 'Department_Sales',
'MaritalStatus_Divorced', 'MaritalStatus_Married',
'MaritalStatus_Single', 'EducationField_Human Resources',
'EducationField_Life Sciences', 'EducationField_Marketing',
'EducationField_Medical', 'EducationField_Other',
'EducationField_Technical Degree'],
dtype='object')
0 1
1 0
2 1
3 0
4 0
..
1465 0
1466 0
1467 0
1468 0
1469 0
Name: Attrition, Length: 1470, dtype: int64
Now to better understand if the x values and y values don't have a null hypothesis, we can use OLS Summary.
Note: Null hypothesis is a test to determine if the X and Y value share some sort of relation between each other.
To determine this we will look at F-Statistic of the OLS Summary. If the value is 0 then there is no relationship between the X and Y value. If the value is greater than 0 it means there is a relationship. The higher the value of F-statistic, the stronger the relationship between the X and Y value.
### Splitting the data into train and test sets for both x and y value. SPecifying the shuffle (Shuffle allows for mixing data) and train size
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,shuffle = True)
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
x_value = data.drop(['Attrition'], axis = 1)
linear_model = sm.OLS(data.JobSatisfaction, x_value).fit()
print(linear_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: JobSatisfaction R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 4.430e+29
Date: Sun, 09 Apr 2023 Prob (F-statistic): 0.00
Time: 22:47:15 Log-Likelihood: 45099.
No. Observations: 1470 AIC: -9.014e+04
Df Residuals: 1439 BIC: -8.997e+04
Df Model: 30
Covariance Type: nonrobust
=====================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------
Age 4.892e-16 1.61e-15 0.303 0.762 -2.68e-15 3.66e-15
DistanceFromHome -9.116e-16 1.05e-15 -0.867 0.386 -2.97e-15 1.15e-15
Education -1.414e-16 3.06e-16 -0.462 0.644 -7.42e-16 4.59e-16
EnvironmentSatisfaction 2.544e-16 2.79e-16 0.911 0.363 -2.94e-16 8.02e-16
Gender -2.828e-16 6.23e-16 -0.454 0.650 -1.51e-15 9.4e-16
JobInvolvement 1.587e-16 4.29e-16 0.370 0.711 -6.83e-16 1e-15
JobSatisfaction 1.0000 2.77e-16 3.62e+15 0.000 1.000 1.000
MonthlyRate 3.287e-16 1.07e-15 0.308 0.758 -1.76e-15 2.42e-15
NumCompaniesWorked -5.941e-17 1.32e-16 -0.449 0.654 -3.19e-16 2e-16
OverTime -1.154e-16 6.81e-16 -0.169 0.865 -1.45e-15 1.22e-15
PercentSalaryHike -4.344e-16 8.33e-17 -5.216 0.000 -5.98e-16 -2.71e-16
RelationshipSatisfaction 2.344e-16 2.83e-16 0.828 0.408 -3.21e-16 7.9e-16
StockOptionLevel 8.674e-17 4.89e-16 0.177 0.859 -8.72e-16 1.05e-15
TrainingTimesLastYear -2.248e-16 2.38e-16 -0.945 0.345 -6.91e-16 2.42e-16
WorkLifeBalance -1.592e-16 4.33e-16 -0.368 0.713 -1.01e-15 6.9e-16
YearsAtCompany -2.764e-16 9.63e-17 -2.870 0.004 -4.65e-16 -8.75e-17
YearsInCurrentRole -3.057e-17 1.38e-16 -0.222 0.825 -3.01e-16 2.4e-16
YearsSinceLastPromotion -2.627e-16 1.22e-16 -2.160 0.031 -5.01e-16 -2.41e-17
YearsWithCurrManager -1.108e-16 1.41e-16 -0.786 0.432 -3.87e-16 1.66e-16
BusinessTravel_Non-Travel -3.697e-17 1.12e-15 -0.033 0.974 -2.24e-15 2.17e-15
BusinessTravel_Travel_Frequently -2.272e-16 1.02e-15 -0.222 0.824 -2.23e-15 1.78e-15
BusinessTravel_Travel_Rarely -1.37e-16 9.14e-16 -0.150 0.881 -1.93e-15 1.66e-15
Department_Human Resources -2.511e-16 1.64e-15 -0.153 0.878 -3.47e-15 2.96e-15
Department_Research & Development 2.901e-16 1.06e-15 0.274 0.784 -1.79e-15 2.37e-15
Department_Sales 3.469e-17 1.13e-15 0.031 0.975 -2.17e-15 2.24e-15
MaritalStatus_Divorced -2.09e-16 1.06e-15 -0.197 0.844 -2.29e-15 1.88e-15
MaritalStatus_Married 3.452e-16 9.37e-16 0.368 0.713 -1.49e-15 2.18e-15
MaritalStatus_Single 5.109e-16 1.01e-15 0.507 0.612 -1.46e-15 2.49e-15
EducationField_Human Resources -1.926e-16 2.53e-15 -0.076 0.939 -5.15e-15 4.77e-15
EducationField_Life Sciences 3.454e-16 8.06e-16 0.429 0.668 -1.24e-15 1.93e-15
EducationField_Marketing 4.432e-16 1.18e-15 0.375 0.708 -1.88e-15 2.76e-15
EducationField_Medical -6.037e-16 8.52e-16 -0.709 0.478 -2.27e-15 1.07e-15
EducationField_Other -4.372e-16 1.31e-15 -0.335 0.738 -3e-15 2.12e-15
EducationField_Technical Degree -1.394e-16 1.1e-15 -0.127 0.899 -2.29e-15 2.01e-15
==============================================================================
Omnibus: 176.918 Durbin-Watson: 0.159
Prob(Omnibus): 0.000 Jarque-Bera (JB): 252.521
Skew: 0.894 Prob(JB): 1.46e-55
Kurtosis: 3.964 Cond. No. 1.11e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.86e-27. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In our dataset we are trying to predict the job satisfaction level based on the input values. For this we can use Null Hypothesis to state if there is a relationship between the input and output variables.
As we can see above, the F-Statistics value is not 0 and thus we can safely assume that there is a relationship the X and Y value.
Note: F-Statistic is a way to check if there is a relationship between X-value and Y-value. If the value of F-Statistic is 0, then there is no relationship between the variables. The higher the number the more significant the realtionship is.
Further we can also look at T-Statistics values and see that some of the values imapact the values in postivie and negative way significantly. This can be further analysed using SHAP analysis. For the next step, we will be using a simple Linear regressor as our model and perform SHAP analysis on.
### For analysis of the model's performance, we will include mean_squared_error
from sklearn.metrics import mean_squared_error
### We fit the train values for x and y to our linear model
Linear_regression = LinearRegression().fit(x_train, y_train)
### Getting the pr
y_prediction = Linear_regression.predict(x_test)
### Rounding up the predicted values.
y_prediction = np.rint(y_prediction)
### Seperating the coeffients that are used for the prediction
coeff = Linear_regression.coef_
###Placing the data in a dataframe for further analysis.
analysis = pd.DataFrame()
analysis['Y_Predictions'] = y_prediction
analysis['Y_actual'] = y_test
analysis = analysis.dropna()
print(analysis)
#sns.residplot(x = analysis['Y_actual'], y = analysis['Y_Predictions'])
pad = pd.DataFrame()
### placing the coefficients and their coulmn names respectivley in the dataframe and printing out the mean squared error and r squared for model performance
pad['ColumnNames'] = x_train.columns
pad['Coefficients'] = coeff
print("The mean squared error is: {}".format(mean_squared_error(analysis['Y_actual'], analysis['Y_Predictions'])))
print("The R^2 score on train data is: {}".format(Linear_regression.score(x_train, y_train)))
print(pad)
### Plotting the coefficients
pad.plot.bar()
Y_Predictions Y_actual
0 0.0 1.0
4 0.0 0.0
8 0.0 0.0
12 -0.0 0.0
15 0.0 0.0
.. ... ...
351 0.0 0.0
353 0.0 0.0
354 0.0 0.0
355 0.0 0.0
358 0.0 0.0
[90 rows x 2 columns]
The mean squared error is: 0.2111111111111111
The R^2 score on train data is: 0.24095264654344173
ColumnNames Coefficients
0 Age -0.306338
1 DistanceFromHome 0.111671
2 Education -0.003506
3 EnvironmentSatisfaction -0.042771
4 Gender -0.028894
5 JobInvolvement -0.050978
6 JobSatisfaction -0.046996
7 MonthlyRate -0.009996
8 NumCompaniesWorked 0.014952
9 OverTime 0.207308
10 PercentSalaryHike -0.000959
11 RelationshipSatisfaction -0.021432
12 StockOptionLevel -0.024905
13 TrainingTimesLastYear -0.006229
14 WorkLifeBalance -0.021587
15 YearsAtCompany 0.003610
16 YearsInCurrentRole -0.010255
17 YearsSinceLastPromotion 0.012735
18 YearsWithCurrManager -0.012513
19 BusinessTravel_Non-Travel -0.069484
20 BusinessTravel_Travel_Frequently 0.083898
21 BusinessTravel_Travel_Rarely -0.014414
22 Department_Human Resources 0.015890
23 Department_Research & Development -0.029488
24 Department_Sales 0.013597
25 MaritalStatus_Divorced -0.036949
26 MaritalStatus_Married -0.029951
27 MaritalStatus_Single 0.066900
28 EducationField_Human Resources 0.016492
29 EducationField_Life Sciences -0.026858
30 EducationField_Marketing 0.008910
31 EducationField_Medical -0.045314
32 EducationField_Other -0.050188
33 EducationField_Technical Degree 0.096957
<Axes: >
SHAP (SHapley Additive exPlanations) analysis is a method for interpreting machine learning models. It is used to explain the output of a model by computing the contribution of each feature to the prediction. SHAP values provide a way to estimate the importance of each feature in the model's output for a particular instance. For analysis, we will plot summaries of the model, use waterfall analysis to determine the importance of features and their impact on the final prediction, use feature importance for all the values of the model, use dependance plots to show how values are dependant on eachother and use heatmap to see contribution of each feature to the final prediction made by the model for a hetter undestanding
### First we pass our model to the shap explainer to get shaply values of the predictions
linear_explainer_shap = shap.LinearExplainer(Linear_regression, x_train)
### Next to that same explainer we pass the test values and see why the model reached that conclusion and what all values are impacting those conclusions.
shap_values_linear_regression = linear_explainer_shap(x_test)
### We get the summary of the model's predictions and as a bar plot
shap.summary_plot(shap_values_linear_regression, x_test, plot_type = 'bar', max_display=14)
From above, we can see that the distribution of shaply values is more for YearsWithCurrManager. However, to determine the positive and negative importance of the values we will not specify the plot type as bar and just see the summary of the shap values.
shap.summary_plot(shap_values_linear_regression, x_test, max_display = 14)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
But For the same parameter i.e YearsWithCurrManager, if the value becomes lower, it impacts the prediction of the model more negatively than in a positive manner.
We can see that in the graph, the red dots indicate high impact and blue showcase less impact. The direction from the mean line in the center indicates type of impact on the model's output value.
The only value that seems to have a positive impact the higher it become is YearsAtCompany.
print("Waterfall plot for linear regression")
### Here we specify the plot tyle and specify one instance we want the plot to do analysis on.
shap.plots.waterfall(shap_values_linear_regression[10], max_display = 14)
Waterfall plot for linear regression
Now where the above summary plot gave the general distribution of shap valuese, the waterfall plot is used to see for one single instance of the data which value has a positive and negative impact on the output value. This is better to further analyze variable importance at per instance level.
Feature importance plots here are unlike the SHAP waterfall plot, which analyzes the contributions of individual features to a single prediction or instance, the SHAP feature importance plot present below analyzes the importance of each feature across all instances in the dataset. The plot displays a bar chart of the feature importance values, with the most important features listed at the top.
shap.plots.bar(shap_values_linear_regression)
### To get dependance plots for all the columns present in the database, we iterate over the columns and priovide the dependance plots.
for i in x_train.columns:
shap.dependence_plot(i, shap_values_linear_regression.values, x_test)
SHAP dependence plots are used to visualize how the value of a particular feature affects the model's predictions. These plots can help you understand the relationship between a feature and the target variable, and how that relationship changes based on the values of other features in the dataset.
The SHAP dependence plot displays a scatter plot of the feature values and the corresponding SHAP values for each instance in the dataset. The plot shows how the feature value and the SHAP value are related, with the color of each point representing the value of another feature in the dataset.
SHAP heatmap is useful for visualizing how different features interact with each other and how they contribute to the model's output. The heatmap can help you identify which features are most important for the model's predictions, and how those features change across different observations. This can be particularly useful in identifying patterns and trends in your data, and in understanding how your model is making decisions.
shap.plots.heatmap(shap_values_linear_regression)
From above we can see for individual distribututions how YearsWithCurrManager varies over various observation having both positive and negative impact on the final output of the model.
from sklearn import tree
import graphviz
### We initially specify the max depth the model is allowed to create a tree for.
regressor = tree.DecisionTreeClassifier(random_state=0, max_depth=4)
regressor = regressor.fit(x_train, y_train)
predictions = regressor.predict(x_test)
### Again we round up the values.
predictions = np.rint(predictions)
analysis = pd.DataFrame()
analysis['Y_Predictions'] = predictions
analysis['Y_actual'] = y_test
analysis = analysis.dropna()
print(analysis)
#sns.residplot(x = analysis['Y_actual'], y = analysis['Y_Predictions'])
print("The mean squared error is: {}".format(mean_squared_error(analysis['Y_actual'], analysis['Y_Predictions'])))
print("The R^2 score on train data is: {}".format(Linear_regression.score(x_train, y_train)))
### Visualizing the decision tree graph.
#dot_data = tree.export_graphviz(regressor, out_file=None, feature_names=x_train.columns, class_names=y_train.columns)
dot_data = tree.export_graphviz(regressor, out_file=None, feature_names=list(x_train.columns), filled=True)
graph = graphviz.Source(dot_data)
graph
Y_Predictions Y_actual 0 0.0 1.0 4 0.0 0.0 8 0.0 0.0 12 0.0 0.0 15 0.0 0.0 .. ... ... 351 0.0 0.0 353 0.0 0.0 354 0.0 0.0 355 0.0 0.0 358 0.0 0.0 [90 rows x 2 columns] The mean squared error is: 0.24444444444444444 The R^2 score on train data is: 0.24095264654344173
Similarly like above we will be doing shap analysis for our decision tree model.
explainer = shap.Explainer(regressor.predict, x_train)
shap_values_decision_tree = explainer(x_train)
shap.summary_plot(shap_values_decision_tree, x_train, plot_type="bar")
shap.summary_plot(shap_values_decision_tree, x_train)
shap.plots.waterfall(shap_values_decision_tree[30], max_display=15)
Permutation explainer: 1103it [00:54, 19.55it/s]
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
shap.plots.bar(shap_values_decision_tree)
for i in x_train.columns:
shap.dependence_plot(i, shap_values_decision_tree.values, x_train)
shap.plots.heatmap(shap_values_decision_tree)
### We first start with initializing a cluster and shutting it down if any error is encountered
try:
h2o.init()
except:
print("Unexpected error occured")
h2o.cluster().shutdown()
Checking whether there is an H2O instance running at http://localhost:54321. connected.
| H2O_cluster_uptime: | 2 hours 31 mins |
| H2O_cluster_timezone: | Etc/UTC |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.40.0.3 |
| H2O_cluster_version_age: | 4 days |
| H2O_cluster_name: | H2O_from_python_unknownUser_qs15wd |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 2.990 Gb |
| H2O_cluster_total_cores: | 2 |
| H2O_cluster_allowed_cores: | 2 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://localhost:54321 |
| H2O_connection_proxy: | {"http": null, "https": null, "colab_language_server": "/usr/colab/bin/language_service"} |
| H2O_internal_security: | False |
| Python_version: | 3.9.16 final |
### Here we pass our data to be converted in an H2OFrame. This data type will allow for easier manipulation and passing of the data for the training purpose
df = h2o.H2OFrame(data)
### Here we split the dat into train and test and print out the dataset
df_train, df_test = df.split_frame([0.75])
print(df_train)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Age Attrition DistanceFromHome Education EnvironmentSatisfaction Gender JobInvolvement JobSatisfaction MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike RelationshipSatisfaction StockOptionLevel TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager BusinessTravel_Non-Travel BusinessTravel_Travel_Frequently BusinessTravel_Travel_Rarely Department_Human Resources Department_Research & Development Department_Sales MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single EducationField_Human Resources EducationField_Life Sciences EducationField_Marketing EducationField_Medical EducationField_Other EducationField_Technical Degree
0.547619 1 0 2 2 1 3 4 0.698053 8 1 11 1 0 0 1 6 4 0 5 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0
0.452381 1 0.0357143 2 4 0 2 3 0.0121261 6 1 15 2 0 3 3 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0
0.357143 0 0.0714286 4 4 1 3 3 0.845814 1 1 11 3 0 3 3 8 7 3 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0
0.97619 0 0.0714286 3 3 1 4 1 0.316001 4 1 20 1 3 3 2 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0
0.285714 0 0.821429 1 4 0 3 3 0.451355 1 0 22 2 1 2 3 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0
0.47619 0 0.785714 3 4 0 2 3 0.268741 0 0 21 2 0 2 3 9 7 1 8 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0
0.428571 0 0.928571 3 3 0 3 3 0.58153 6 0 13 2 2 3 2 7 7 7 7 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0
0.380952 0 0.642857 2 2 0 3 4 0.267577 0 0 11 3 1 2 3 2 2 1 2 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0
0.261905 0 0.714286 4 2 1 4 1 0.325276 1 0 11 3 1 1 3 10 9 8 8 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0
0.333333 0 0.142857 2 1 0 4 2 0.520337 0 1 12 4 2 5 2 6 2 0 5 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0
[1091 rows x 35 columns]
### Here we specify our x and y values that we want to pass.
x = df.columns
y = 'Attrition'
x = x.remove(y)
### As out data isn't a logistic regression/binary classification we can remove the below two lines.
df_train[y] = df_train[y].asfactor()
df_test[y] = df_test[y].asfactor()
### Finally we create an instance of AutoML and specify max runtime and make sure the balance class is set to true as we had seen earlier that our number of sample of classes weren't even.
aml = H2OAutoML(max_runtime_secs=222, balance_classes=True, seed=1)
aml.train(x = x, y = y, training_frame = df_train)
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Model Details ============= H2OStackedEnsembleEstimator : Stacked Ensemble Model Key: StackedEnsemble_BestOfFamily_4_AutoML_3_20230409_02208
| key | value |
|---|---|
| Stacking strategy | cross_validation |
| Number of base models (used / total) | 5/6 |
| # GBM base models (used / total) | 1/1 |
| # XGBoost base models (used / total) | 1/1 |
| # GLM base models (used / total) | 1/1 |
| # DeepLearning base models (used / total) | 1/1 |
| # DRF base models (used / total) | 1/2 |
| Metalearner algorithm | GLM |
| Metalearner fold assignment scheme | Random |
| Metalearner nfolds | 5 |
| Metalearner fold_column | None |
| Custom metalearner hyperparameters | None |
ModelMetricsBinomialGLM: stackedensemble ** Reported on train data. ** MSE: 0.06708424630757806 RMSE: 0.25900626692722717 LogLoss: 0.23750592146249005 AUC: 0.9310165782739309 AUCPR: 0.8174902990490374 Gini: 0.8620331565478618 Null degrees of freedom: 1090 Residual degrees of freedom: 1085 Null deviance: 967.4113791384584 Residual deviance: 518.2379206311532 AIC: 530.2379206311532
| 0 | 1 | Error | Rate | |
|---|---|---|---|---|
| 0 | 872.0 | 42.0 | 0.046 | (42.0/914.0) |
| 1 | 49.0 | 128.0 | 0.2768 | (49.0/177.0) |
| Total | 921.0 | 170.0 | 0.0834 | (91.0/1091.0) |
| metric | threshold | value | idx |
|---|---|---|---|
| max f1 | 0.3370898 | 0.7377522 | 125.0 |
| max f2 | 0.2533374 | 0.7749469 | 166.0 |
| max f0point5 | 0.4688383 | 0.8093797 | 95.0 |
| max accuracy | 0.4688383 | 0.9230064 | 95.0 |
| max precision | 0.9635126 | 1.0 | 0.0 |
| max recall | 0.0120916 | 1.0 | 383.0 |
| max specificity | 0.9635126 | 1.0 | 0.0 |
| max absolute_mcc | 0.4334891 | 0.6935376 | 100.0 |
| max min_per_class_accuracy | 0.2037646 | 0.8566740 | 194.0 |
| max mean_per_class_accuracy | 0.2533374 | 0.8642893 | 166.0 |
| max tns | 0.9635126 | 914.0 | 0.0 |
| max fns | 0.9635126 | 176.0 | 0.0 |
| max fps | 0.0014769 | 914.0 | 399.0 |
| max tps | 0.0120916 | 177.0 | 383.0 |
| max tnr | 0.9635126 | 1.0 | 0.0 |
| max fnr | 0.9635126 | 0.9943503 | 0.0 |
| max fpr | 0.0014769 | 1.0 | 399.0 |
| max tpr | 0.0120916 | 1.0 | 383.0 |
| group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | score | cumulative_response_rate | cumulative_score | capture_rate | cumulative_capture_rate | gain | cumulative_gain | kolmogorov_smirnov |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.0100825 | 0.9034193 | 6.1638418 | 6.1638418 | 1.0 | 0.9295806 | 1.0 | 0.9295806 | 0.0621469 | 0.0621469 | 516.3841808 | 516.3841808 | 0.0621469 |
| 2 | 0.0201650 | 0.8297520 | 6.1638418 | 6.1638418 | 1.0 | 0.8658675 | 1.0 | 0.8977240 | 0.0621469 | 0.1242938 | 516.3841808 | 516.3841808 | 0.1242938 |
| 3 | 0.0302475 | 0.7778185 | 6.1638418 | 6.1638418 | 1.0 | 0.8105521 | 1.0 | 0.8686667 | 0.0621469 | 0.1864407 | 516.3841808 | 516.3841808 | 0.1864407 |
| 4 | 0.0403300 | 0.7355400 | 6.1638418 | 6.1638418 | 1.0 | 0.7595903 | 1.0 | 0.8413976 | 0.0621469 | 0.2485876 | 516.3841808 | 516.3841808 | 0.2485876 |
| 5 | 0.0504125 | 0.6893542 | 6.1638418 | 6.1638418 | 1.0 | 0.7084455 | 1.0 | 0.8148072 | 0.0621469 | 0.3107345 | 516.3841808 | 516.3841808 | 0.3107345 |
| 6 | 0.1008249 | 0.4977528 | 4.8190036 | 5.4914227 | 0.7818182 | 0.5824157 | 0.8909091 | 0.6986114 | 0.2429379 | 0.5536723 | 381.9003595 | 449.1422702 | 0.5405432 |
| 7 | 0.1503208 | 0.3474040 | 3.0819209 | 4.6980502 | 0.5 | 0.4151475 | 0.7621951 | 0.6052758 | 0.1525424 | 0.7062147 | 208.1920904 | 369.8050158 | 0.6635451 |
| 8 | 0.2007333 | 0.2645546 | 1.9051875 | 3.9966463 | 0.3090909 | 0.3033519 | 0.6484018 | 0.5294501 | 0.0960452 | 0.8022599 | 90.5187468 | 299.6646286 | 0.7180148 |
| 9 | 0.3006416 | 0.1671354 | 0.8482351 | 2.9503755 | 0.1376147 | 0.2137292 | 0.4786585 | 0.4245307 | 0.0847458 | 0.8870056 | -15.1764889 | 195.0375500 | 0.6999159 |
| 10 | 0.4005500 | 0.1166096 | 0.5089411 | 2.3414136 | 0.0825688 | 0.1411636 | 0.3798627 | 0.3538510 | 0.0508475 | 0.9378531 | -49.1058933 | 134.1413593 | 0.6413542 |
| 11 | 0.5004583 | 0.0860542 | 0.2261960 | 1.9191449 | 0.0366972 | 0.0993268 | 0.3113553 | 0.3030394 | 0.0225989 | 0.9604520 | -77.3803970 | 91.9144885 | 0.5490734 |
| 12 | 0.6003666 | 0.0601386 | 0.0 | 1.5997757 | 0.0 | 0.0712955 | 0.2595420 | 0.2644744 | 0.0 | 0.9604520 | -100.0 | 59.9775736 | 0.4298174 |
| 13 | 0.7002750 | 0.0414830 | 0.2261960 | 1.4038069 | 0.0366972 | 0.0511607 | 0.2277487 | 0.2340409 | 0.0225989 | 0.9830508 | -77.3803970 | 40.3806904 | 0.3375366 |
| 14 | 0.8001833 | 0.0254370 | 0.1130980 | 1.2426531 | 0.0183486 | 0.0321108 | 0.2016037 | 0.2088286 | 0.0112994 | 0.9943503 | -88.6901985 | 24.2653102 | 0.2317682 |
| 15 | 0.9000917 | 0.0136238 | 0.0 | 1.1047211 | 0.0 | 0.0191969 | 0.1792261 | 0.1877798 | 0.0 | 0.9943503 | -100.0 | 10.4721139 | 0.1125122 |
| 16 | 1.0 | 0.0010811 | 0.0565490 | 1.0 | 0.0091743 | 0.0082873 | 0.1622365 | 0.1698470 | 0.0056497 | 1.0 | -94.3450993 | 0.0 | 0.0 |
ModelMetricsBinomialGLM: stackedensemble ** Reported on cross-validation data. ** MSE: 0.09497797605451712 RMSE: 0.30818497052016847 LogLoss: 0.3252672555113068 AUC: 0.8325513975942341 AUCPR: 0.6076576875662298 Gini: 0.6651027951884683 Null degrees of freedom: 1090 Residual degrees of freedom: 1085 Null deviance: 967.6398783795136 Residual deviance: 709.7331515256715 AIC: 721.7331515256715
| 0 | 1 | Error | Rate | |
|---|---|---|---|---|
| 0 | 785.0 | 129.0 | 0.1411 | (129.0/914.0) |
| 1 | 55.0 | 122.0 | 0.3107 | (55.0/177.0) |
| Total | 840.0 | 251.0 | 0.1687 | (184.0/1091.0) |
| metric | threshold | value | idx |
|---|---|---|---|
| max f1 | 0.2276445 | 0.5700935 | 178.0 |
| max f2 | 0.1377375 | 0.6472197 | 242.0 |
| max f0point5 | 0.4863310 | 0.6330275 | 78.0 |
| max accuracy | 0.4863310 | 0.8799267 | 78.0 |
| max precision | 0.9697100 | 1.0 | 0.0 |
| max recall | 0.0012486 | 1.0 | 398.0 |
| max specificity | 0.9697100 | 1.0 | 0.0 |
| max absolute_mcc | 0.3821308 | 0.4971459 | 109.0 |
| max min_per_class_accuracy | 0.1554190 | 0.7627119 | 226.0 |
| max mean_per_class_accuracy | 0.2158388 | 0.7776150 | 185.0 |
| max tns | 0.9697100 | 914.0 | 0.0 |
| max fns | 0.9697100 | 176.0 | 0.0 |
| max fps | 0.0006290 | 914.0 | 399.0 |
| max tps | 0.0012486 | 177.0 | 398.0 |
| max tnr | 0.9697100 | 1.0 | 0.0 |
| max fnr | 0.9697100 | 0.9943503 | 0.0 |
| max fpr | 0.0006290 | 1.0 | 399.0 |
| max tpr | 0.0012486 | 1.0 | 398.0 |
| group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | score | cumulative_response_rate | cumulative_score | capture_rate | cumulative_capture_rate | gain | cumulative_gain | kolmogorov_smirnov |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.0100825 | 0.8740287 | 6.1638418 | 6.1638418 | 1.0 | 0.9146462 | 1.0 | 0.9146462 | 0.0621469 | 0.0621469 | 516.3841808 | 516.3841808 | 0.0621469 |
| 2 | 0.0201650 | 0.7934841 | 5.0431433 | 5.6034926 | 0.8181818 | 0.8207936 | 0.9090909 | 0.8677199 | 0.0508475 | 0.1129944 | 404.3143297 | 460.3492553 | 0.1108062 |
| 3 | 0.0302475 | 0.7221088 | 5.0431433 | 5.4167095 | 0.8181818 | 0.7606231 | 0.8787879 | 0.8320210 | 0.0508475 | 0.1638418 | 404.3143297 | 441.6709468 | 0.1594654 |
| 4 | 0.0403300 | 0.6881472 | 4.4827940 | 5.1832306 | 0.7272727 | 0.7059202 | 0.8409091 | 0.8004958 | 0.0451977 | 0.2090395 | 348.2794042 | 418.3230611 | 0.2013809 |
| 5 | 0.0504125 | 0.6426058 | 3.3620955 | 4.8190036 | 0.5454545 | 0.6650404 | 0.7818182 | 0.7734047 | 0.0338983 | 0.2429379 | 236.2095532 | 381.9003595 | 0.2298088 |
| 6 | 0.1008249 | 0.4270645 | 3.6983051 | 4.2586543 | 0.6 | 0.5238979 | 0.6909091 | 0.6486513 | 0.1864407 | 0.4293785 | 269.8305085 | 325.8654340 | 0.3921794 |
| 7 | 0.1503208 | 0.3289637 | 2.0546139 | 3.5329337 | 0.3333333 | 0.3745490 | 0.5731707 | 0.5583981 | 0.1016949 | 0.5310734 | 105.4613936 | 253.2933719 | 0.4544870 |
| 8 | 0.2007333 | 0.2587095 | 1.9051875 | 3.1241390 | 0.3090909 | 0.2893493 | 0.5068493 | 0.4908288 | 0.0960452 | 0.6271186 | 90.5187468 | 212.4138999 | 0.5089567 |
| 9 | 0.3006416 | 0.1715995 | 1.0744311 | 2.4429861 | 0.1743119 | 0.2083196 | 0.3963415 | 0.3969462 | 0.1073446 | 0.7344633 | 7.4431141 | 144.2986082 | 0.5178331 |
| 10 | 0.4005500 | 0.1194211 | 0.7916861 | 2.0311058 | 0.1284404 | 0.1406818 | 0.3295195 | 0.3330267 | 0.0790960 | 0.8135593 | -20.8313896 | 103.1105767 | 0.4929904 |
| 11 | 0.5004583 | 0.0872070 | 0.3392940 | 1.6933631 | 0.0550459 | 0.1014152 | 0.2747253 | 0.2867892 | 0.0338983 | 0.8474576 | -66.0705956 | 69.3363134 | 0.4141972 |
| 12 | 0.6003666 | 0.0603426 | 0.4523921 | 1.4868504 | 0.0733945 | 0.0727316 | 0.2412214 | 0.2511674 | 0.0451977 | 0.8926554 | -54.7607941 | 48.6850390 | 0.3488917 |
| 13 | 0.7002750 | 0.0418816 | 0.3958431 | 1.3311962 | 0.0642202 | 0.0508029 | 0.2159686 | 0.2225814 | 0.0395480 | 0.9322034 | -60.4156948 | 33.1196202 | 0.2768423 |
| 14 | 0.8001833 | 0.0257670 | 0.2827450 | 1.2002899 | 0.0458716 | 0.0333123 | 0.1947308 | 0.1989498 | 0.0282486 | 0.9604520 | -71.7254963 | 20.0289928 | 0.1913054 |
| 15 | 0.9000917 | 0.0143539 | 0.2261960 | 1.0921675 | 0.0366972 | 0.0199862 | 0.1771894 | 0.1790852 | 0.0225989 | 0.9830508 | -77.3803970 | 9.2167489 | 0.0990246 |
| 16 | 1.0 | 0.0005999 | 0.1696470 | 1.0 | 0.0275229 | 0.0079601 | 0.1622365 | 0.1619884 | 0.0169492 | 1.0 | -83.0352978 | 0.0 | 0.0 |
| mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | cv_4_valid | cv_5_valid | |
|---|---|---|---|---|---|---|---|
| accuracy | 0.8619171 | 0.0353565 | 0.8812785 | 0.8584475 | 0.8807340 | 0.8018018 | 0.8873239 |
| auc | 0.8309743 | 0.0361391 | 0.8771904 | 0.7842946 | 0.8139499 | 0.8244461 | 0.8549906 |
| err | 0.1380829 | 0.0353565 | 0.1187215 | 0.1415525 | 0.1192661 | 0.1981982 | 0.1126761 |
| err_count | 30.2 | 8.136338 | 26.0 | 31.0 | 26.0 | 44.0 | 24.0 |
| f0point5 | 0.5980484 | 0.0823182 | 0.6521739 | 0.5529954 | 0.6418919 | 0.4745763 | 0.6686047 |
| f1 | 0.6072270 | 0.0353989 | 0.6176470 | 0.6075949 | 0.59375 | 0.56 | 0.6571429 |
| f2 | 0.6284139 | 0.0568205 | 0.5865922 | 0.6741573 | 0.5523256 | 0.6829268 | 0.6460674 |
| lift_top_group | 5.7512155 | 0.9013953 | 5.918919 | 6.6363635 | 6.0555553 | 4.2285714 | 5.9166665 |
| logloss | 0.3250895 | 0.0239453 | 0.2960839 | 0.3356675 | 0.3389248 | 0.3512427 | 0.3035282 |
| max_per_class_error | 0.3476986 | 0.1121399 | 0.4324324 | 0.2727273 | 0.4722222 | 0.2 | 0.3611111 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| mean_per_class_error | 0.2221176 | 0.0288116 | 0.2436888 | 0.1955034 | 0.2608364 | 0.1989305 | 0.211629 |
| mse | 0.0949170 | 0.0071164 | 0.0880642 | 0.0956056 | 0.0994174 | 0.1039190 | 0.0875787 |
| null_deviance | 193.52797 | 4.760831 | 199.06012 | 186.0061 | 195.38908 | 193.53395 | 193.65062 |
| pr_auc | 0.6072177 | 0.0929142 | 0.6862503 | 0.5429464 | 0.6014975 | 0.4929664 | 0.7124276 |
| precision | 0.5969939 | 0.114819 | 0.6774194 | 0.5217391 | 0.6785714 | 0.4307692 | 0.6764706 |
| r2 | 0.2997116 | 0.0717751 | 0.3727874 | 0.2529586 | 0.2788899 | 0.2174877 | 0.3764347 |
| recall | 0.6523014 | 0.1121399 | 0.5675676 | 0.7272728 | 0.5277778 | 0.8 | 0.6388889 |
| residual_deviance | 141.94662 | 11.895875 | 129.68475 | 147.0224 | 147.77122 | 155.95175 | 129.30301 |
| rmse | 0.3079130 | 0.0115440 | 0.2967562 | 0.3092016 | 0.3153053 | 0.3223647 | 0.295937 |
| specificity | 0.9034634 | 0.0629861 | 0.9450549 | 0.8817204 | 0.9505494 | 0.8021390 | 0.9378531 |
[22 rows x 8 columns]
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
### Here we are printing out the leaderboard which shows which model performed best.
best_model = aml.leaderboard
print(best_model)
model_id auc logloss aucpr mean_per_class_error rmse mse StackedEnsemble_BestOfFamily_4_AutoML_3_20230409_02208 0.832551 0.325267 0.607658 0.225936 0.308185 0.094978 StackedEnsemble_BestOfFamily_3_AutoML_3_20230409_02208 0.83163 0.324762 0.61299 0.252714 0.308019 0.094876 StackedEnsemble_BestOfFamily_1_AutoML_3_20230409_02208 0.83096 0.328013 0.602189 0.237047 0.31005 0.0961308 StackedEnsemble_AllModels_2_AutoML_3_20230409_02208 0.830088 0.327607 0.603268 0.248436 0.309847 0.0960054 StackedEnsemble_AllModels_1_AutoML_3_20230409_02208 0.829074 0.329436 0.593937 0.239872 0.311104 0.0967857 GLM_1_AutoML_3_20230409_02208 0.828969 0.329588 0.601384 0.286872 0.310004 0.0961023 StackedEnsemble_BestOfFamily_5_AutoML_3_20230409_02208 0.827965 0.330358 0.592995 0.245791 0.31115 0.0968144 StackedEnsemble_BestOfFamily_2_AutoML_3_20230409_02208 0.82484 0.332599 0.590864 0.272846 0.31144 0.0969947 XGBoost_grid_1_AutoML_3_20230409_02208_model_27 0.820501 0.338062 0.552403 0.233317 0.318518 0.101454 XGBoost_grid_1_AutoML_3_20230409_02208_model_1 0.818356 0.337356 0.562359 0.260283 0.317408 0.100748 [81 rows x 7 columns]
### For simple metric analysis we are here printing out the performance of the model that was at the top of our leaderboard.
model_imp = aml.leader
model_imp.model_performance(df_test)
ModelMetricsBinomialGLM: stackedensemble ** Reported on test data. ** MSE: 0.09595816938898644 RMSE: 0.3097711564832763 LogLoss: 0.33105234312506543 AUC: 0.8180250783699059 AUCPR: 0.5994292798981069 Gini: 0.6360501567398118 Null degrees of freedom: 378 Residual degrees of freedom: 373 Null deviance: 331.1824167648591 Residual deviance: 250.9376760887996 AIC: 262.9376760887996
| 0 | 1 | Error | Rate | |
|---|---|---|---|---|
| 0 | 297.0 | 22.0 | 0.069 | (22.0/319.0) |
| 1 | 27.0 | 33.0 | 0.45 | (27.0/60.0) |
| Total | 324.0 | 55.0 | 0.1293 | (49.0/379.0) |
| metric | threshold | value | idx |
|---|---|---|---|
| max f1 | 0.3786996 | 0.5739130 | 54.0 |
| max f2 | 0.1584724 | 0.6451613 | 131.0 |
| max f0point5 | 0.5142324 | 0.6382979 | 31.0 |
| max accuracy | 0.5142324 | 0.8839050 | 31.0 |
| max precision | 0.9454817 | 1.0 | 0.0 |
| max recall | 0.0095000 | 1.0 | 357.0 |
| max specificity | 0.9454817 | 1.0 | 0.0 |
| max absolute_mcc | 0.4846359 | 0.5004756 | 35.0 |
| max min_per_class_accuracy | 0.1738946 | 0.7492163 | 124.0 |
| max mean_per_class_accuracy | 0.1584724 | 0.7683386 | 131.0 |
| max tns | 0.9454817 | 319.0 | 0.0 |
| max fns | 0.9454817 | 59.0 | 0.0 |
| max fps | 0.0018311 | 319.0 | 378.0 |
| max tps | 0.0095000 | 60.0 | 357.0 |
| max tnr | 0.9454817 | 1.0 | 0.0 |
| max fnr | 0.9454817 | 0.9833333 | 0.0 |
| max fpr | 0.0018311 | 1.0 | 378.0 |
| max tpr | 0.0095000 | 1.0 | 357.0 |
| group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | score | cumulative_response_rate | cumulative_score | capture_rate | cumulative_capture_rate | gain | cumulative_gain | kolmogorov_smirnov |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.0105541 | 0.8223484 | 6.3166667 | 6.3166667 | 1.0 | 0.9033537 | 1.0 | 0.9033537 | 0.0666667 | 0.0666667 | 531.6666667 | 531.6666667 | 0.0666667 |
| 2 | 0.0211082 | 0.7563015 | 6.3166667 | 6.3166667 | 1.0 | 0.7854710 | 1.0 | 0.8444123 | 0.0666667 | 0.1333333 | 531.6666667 | 531.6666667 | 0.1333333 |
| 3 | 0.0316623 | 0.7067602 | 4.7375 | 5.7902778 | 0.75 | 0.7287966 | 0.9166667 | 0.8058738 | 0.05 | 0.1833333 | 373.75 | 479.0277778 | 0.1801985 |
| 4 | 0.0422164 | 0.6521768 | 4.7375 | 5.5270833 | 0.75 | 0.6755749 | 0.875 | 0.7732990 | 0.05 | 0.2333333 | 373.75 | 452.7083333 | 0.2270637 |
| 5 | 0.0501319 | 0.6211872 | 4.2111111 | 5.3192982 | 0.6666667 | 0.6344225 | 0.8421053 | 0.7513712 | 0.0333333 | 0.2666667 | 321.1111111 | 431.9298246 | 0.2572623 |
| 6 | 0.1002639 | 0.4736257 | 3.3245614 | 4.3219298 | 0.5263158 | 0.5432394 | 0.6842105 | 0.6473053 | 0.1666667 | 0.4333333 | 232.4561404 | 332.1929825 | 0.3957158 |
| 7 | 0.1503958 | 0.3628090 | 2.3271930 | 3.6570175 | 0.3684211 | 0.4179844 | 0.5789474 | 0.5708650 | 0.1166667 | 0.55 | 132.7192982 | 265.7017544 | 0.4747649 |
| 8 | 0.2005277 | 0.2832859 | 0.6649123 | 2.9089912 | 0.1052632 | 0.3213881 | 0.4605263 | 0.5084957 | 0.0333333 | 0.5833333 | -33.5087719 | 190.8991228 | 0.4548067 |
| 9 | 0.3007916 | 0.1953977 | 1.3298246 | 2.3826023 | 0.2105263 | 0.2333114 | 0.3771930 | 0.4167676 | 0.1333333 | 0.7166667 | 32.9824561 | 138.2602339 | 0.4940961 |
| 10 | 0.4010554 | 0.1315866 | 0.9973684 | 2.0362939 | 0.1578947 | 0.1572641 | 0.3223684 | 0.3518917 | 0.1 | 0.8166667 | -0.2631579 | 103.6293860 | 0.4937827 |
| 11 | 0.5013193 | 0.0795640 | 0.4986842 | 1.7287719 | 0.0789474 | 0.1032970 | 0.2736842 | 0.3021728 | 0.05 | 0.8666667 | -50.1315789 | 72.8771930 | 0.4340648 |
| 12 | 0.5989446 | 0.0559557 | 0.3414414 | 1.5026432 | 0.0540541 | 0.0668548 | 0.2378855 | 0.2638170 | 0.0333333 | 0.9 | -65.8558559 | 50.2643172 | 0.3576803 |
| 13 | 0.6992084 | 0.0392624 | 0.0 | 1.2871698 | 0.0 | 0.0471835 | 0.2037736 | 0.2327526 | 0.0 | 0.9 | -100.0 | 28.7169811 | 0.2385580 |
| 14 | 0.7994723 | 0.0234738 | 0.3324561 | 1.1674367 | 0.0526316 | 0.0302179 | 0.1848185 | 0.2073522 | 0.0333333 | 0.9333333 | -66.7543860 | 16.7436744 | 0.1590387 |
| 15 | 0.8997361 | 0.0147036 | 0.1662281 | 1.0558651 | 0.0263158 | 0.0192044 | 0.1671554 | 0.1863856 | 0.0166667 | 0.95 | -83.3771930 | 5.5865103 | 0.0597179 |
| 16 | 1.0 | 0.0018311 | 0.4986842 | 1.0 | 0.0789474 | 0.0087104 | 0.1583113 | 0.1685712 | 0.05 | 1.0 | -50.1315789 | 0.0 | 0.0 |
Now to perform better analysis, we can't use our leaderboard's top performer i.e. Stacked ensemble. This is becuase the ensemble is a collection of different models predicting different values and finally making a decision based on the value.
Thus we will be using GLM or Generalized Linear Model.
### As discussed earlier we are specifying to use GLM model but also making sure the model that is being picked for analysis is the best model in a collection of GLMs.
model_to_explain = aml.get_best_model(algorithm='gbm')
explain = model_to_explain.explain(df_test)
Confusion matrix shows a predicted class vs an actual class.
| 0 | 1 | Error | Rate | |
|---|---|---|---|---|
| 0 | 309.0 | 10.0 | 0.0313 | (10.0/319.0) |
| 1 | 29.0 | 31.0 | 0.4833 | (29.0/60.0) |
| Total | 338.0 | 41.0 | 0.1029 | (39.0/379.0) |
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
The variable importance plot shows the relative importance of the most important variables in the model.
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
#glm_parameters = {"learn_rate": [i * 0.01 for i in range(1, 11)], "max_depth": list(range(2, 11)), "sample_rate": [i * 0.1 for i in range(5, 11)],"col_sample_rate": [i * 0.1 for i in range(1, 11)], "alpha": [0.05, 0.01]}
glm_parameters = {"lambda": [i * 0.01 for i in range(1, 11)], "alpha": [i*0.01 for i in range(1, 11)]}
search_criteria = {"strategy": "RandomDiscrete", "max_models": 30, "seed": 1}
#glm_grid = H2OGridSearch(model=H2OGradientBoostingEstimator, grid_id="gbm_grid_search", hyper_params=glm_parameters, search_criteria=search_criteria)
glm_grid = H2OGridSearch(H2OGeneralizedLinearEstimator('binomial'), glm_parameters)
#glm_grid.train(x=x, y=y, training_frame=df_train, ntrees=100, seed=1)
glm_grid.train(x=x, y=y, training_frame=df_train)
gbm_gridperf2 = glm_grid.get_grid(sort_by="rmse", decreasing =False)
print(gbm_gridperf2)
glm Grid Build progress: |███████████████████████████████████████████████████████| (done) 100%
Adding alpha array to hyperparameter runs slower with gridsearch. This is due to the fact that the algo has to run initialization for every alpha value. Setting the alpha array as a model parameter will skip the initialization and run faster overall.
Hyper-Parameter Search Summary: ordered by increasing rmse
alpha lambda model_ids rmse
--- ------- -------- ------------------------------------------------------------------ -------------------
0.01 0.01 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_1 0.2983337073109158
0.02 0.01 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_2 0.29837778572418516
0.03 0.01 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_3 0.29841485181972793
0.04 0.01 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_4 0.29845666335716997
0.05 0.01 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_5 0.298498659621854
0.06 0.01 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_6 0.2985428135561869
0.07 0.01 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_7 0.2985870435248281
0.08 0.01 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_8 0.2986328545611587
0.09 0.01 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_9 0.29868835825715434
0.1 0.01 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_10 0.29873750363442564
--- --- --- --- ---
0.1 0.08 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_80 0.32324533106527775
0.05 0.1 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_95 0.3234578927279676
0.08 0.09 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_88 0.3240593987620105
0.06 0.1 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_96 0.32440802934862745
0.09 0.09 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_89 0.32503769101581687
0.07 0.1 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_97 0.3254320639329165
0.1 0.09 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_90 0.3259384582347323
0.08 0.1 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_98 0.3265032379960393
0.09 0.1 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_99 0.3274818491529781
0.1 0.1 Grid_GLM_py_797_sid_ae9b_model_python_1680990637213_7732_model_100 0.3284789500742035
[100 rows x 5 columns]
For this assignment we performed 4 major steps
Data cleaning: In this we used correlation matrix to determine multicolinearity and scaling of data to make sure the model doesn't suffer with overfitting.
Feature selection: In feature selection we used the same correlation matrix to determina values that have a significant positive or negative relationship to the our output variable that is 'Attrition'.
Modelling: For modelling we used linear classifier model, decision tree and autoML to determine the best performance. For further improving the performance, we optimized the hyperparameters by using grid search that is a feature of H2O.
Interpreatability: Interpretability is a concept of understanding the model's prediction based on input. We use SHAP analysis to determine the significance of input variables and the respective output. For this we use:
From above we figured out the relationship of various variables and whether they have a positive or negative impact and by what degree.
In the above we used OLS summary to determine null hypothesis or not. As there was a significant relationship we could predict attrition from the input variables. The performance of the model can be determined by Root Mean Squared error and for our test we have used:
(Note: In autoML we use a collection of models and determine the best performing model using the leaderboard.)
Below are the details and performance of the training:
| Model | Performance (RMSE) |
|---|---|
| Linear Model | 0.24095264654344173 |
| Decision tree classifier | 0.24095264654344173 |
| AutoML (StackedEnsemble_BestOfFamily_4_AutoML_3_20230409_02208) | 0.308185 |
From this we can see that the model performed well enough.
From the above article we were able to learn the concepts of data clening, feature selection, variable significance, model types, performance metrics of a model and its output, optimization of a model performance and interpretting a model's output based on the given input.
These key concepts will help in further analyzing the data and also determine the best model that can work with the dataset. We can also use the teachings from the article to further analyze a model's output which will inturn help with understanding the hyperparameters and their tuning.
References
The references used for this article are as follows:
MIT License
Copyright © 2023 Tarush Ghanshyam Singh
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.